Hypothesis

Non-linear models will be more accurate than a logistic regression model


In [1]:
# Load data
import pandas as pd
with open('./data_files/8lWZYw-u-yNbGBkC4B--ip77K1oVwwyZTHKLeD7rm7k.csv') as data_file:
    df = pd.read_csv(data_file)
df.head()


Out[1]:
Subject Id ConversationId Importance SentDateTime Body CcRecipients Sender ToRecipients FolderId
0 OPGIdentity Requests Quarterly Review and 6 ot... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-27T11:06:17Z Logocidagendaicon Your agenda for Mond... NaN no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
1 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-21T20:25:49Z The monitor that will create these alerts has ... NaN akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
2 App API Scrum Monday Series and 4 other event... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-20T11:11:30Z Logocidagendaicon Your agenda for Mond... NaN no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
3 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-17T22:39:56Z Description Description Description... NaN akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...
4 Notification AAD certificate roll in March for... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-15T19:15:21Z cidimage001png01D16D8184C55A30cidimage003j... NaN shiung.yong@microsoft.com aadpartnersnotify@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm...

Comparing classification models


  • Do some preprocessing on the text columns (subject, body, maybe to, cc, from)
    • Clean NaN's or remove rows of data with NaNs
    • Do stuff the Preprocess Text Azure module does for us (stopwords, etc)
    • Use scikit learn where possible
  • Do some feature construction using pandas & scikit learn
    • On subject, to, cc, from
    • Bag of words
    • TF/IDF
  • One-Hot Encode FolderId labels into their own boolean columns (1s & 0s)
  • Ignore better features for now, this is good enough for comparisions
  • Split data into training & test sets to be used for all ensemble members
  • For each classifier, train a model on the training data
  • Evaluate performance of model on test data, compare to Logistic Regression model

Constructing Subject Feature Matrix



In [2]:
# Remove messages without a Subject
print df.shape
df = df.dropna(subset=['Subject'])
print df.shape


(10301, 10)
(10295, 10)

In [3]:
# Perform bag of words feature extraction
# TODO: Why are there only 3000 words in the vocabulary?
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words='english', lowercase=True)
train_counts = count_vect.fit_transform(df['Subject'])
print 'Dimensions of vocabulary feature matrix are:'
print train_counts.shape


Dimensions of vocabulary feature matrix are:
(10295, 3119)

In [4]:
# Add TF/IDF weighting to account for lenght of documents
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)
print 'Dimensions of vocabulary feature matrix are:'
print train_tfidf.shape
print 'But, its a sparse matrix: ' + str(type(train_tfidf))


Dimensions of vocabulary feature matrix are:
(10295, 3119)
But, its a sparse matrix: <class 'scipy.sparse.csr.csr_matrix'>

Constructing CC, To, and From



In [5]:
# Merge CC, To, From into one People column
df['CcRecipients'].fillna('', inplace=True)
df['ToRecipients'].fillna('', inplace=True)
df['Sender'].fillna('', inplace=True)
df['People'] = df['Sender'] + ';' + df['CcRecipients'] + ';' + df['ToRecipients']
df.head(10)


Out[5]:
Subject Id ConversationId Importance SentDateTime Body CcRecipients Sender ToRecipients FolderId People
0 OPGIdentity Requests Quarterly Review and 6 ot... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-27T11:06:17Z Logocidagendaicon Your agenda for Mond... no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... no-reply@microsoft.com;;dastrock@microsoft.com
1 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-21T20:25:49Z The monitor that will create these alerts has ... akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... akina@microsoft.com;;msodsswat@microsoft.com;e...
2 App API Scrum Monday Series and 4 other event... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-20T11:11:30Z Logocidagendaicon Your agenda for Mond... no-reply@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... no-reply@microsoft.com;;dastrock@microsoft.com
3 Throttling alerts from Gateway AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-17T22:39:56Z Description Description Description... akina@microsoft.com msodsswat@microsoft.com;estsincident@microsoft... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... akina@microsoft.com;;msodsswat@microsoft.com;e...
4 Notification AAD certificate roll in March for... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-15T19:15:21Z cidimage001png01D16D8184C55A30cidimage003j... shiung.yong@microsoft.com aadpartnersnotify@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... shiung.yong@microsoft.com;;aadpartnersnotify@m...
5 OpenID RP Certification Launch Announcement AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2017-02-14T22:40:30Z Part of the OpenID Foundation efforts to conti... oauth@microsoft.com;catpm@microsoft.com;mssts@... michael.jones@microsoft.com openid@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... michael.jones@microsoft.com;oauth@microsoft.co...
6 How to determine if an account is fully provis... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-09-06T21:31:42Z Hey guys I am Anbin from OneNote team I am ... pthiruv@exchange.microsoft.com;wenjenc@microso... anbinm@microsoft.com msareq@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... anbinm@microsoft.com;pthiruv@exchange.microsof...
7 Registering map platform component as an app f... AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-09-01T02:04:17Z Hey MsaReq I’m on the maps platform team an... icheck@microsoft.com msareq@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... icheck@microsoft.com;;msareq@microsoft.com
8 Accepted Fixing MSA Developer Requests AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-08-10T19:48:31Z wibartle@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... wibartle@microsoft.com;;dastrock@microsoft.com
9 Accepted Fixing MSA Developer Requests AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... AAQkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... Normal 2016-08-10T19:47:41Z adfrei@microsoft.com dastrock@microsoft.com AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZm... adfrei@microsoft.com;;dastrock@microsoft.com

In [6]:
# Convert People to matrix representation
people_features = df['People'].str.get_dummies(sep=';')
print people_features.shape
people_features.head()


(10295, 3530)
Out[6]:
11franklinc@gmail.com _ram@microsoft.com a-amgeo@microsoft.com a-asokuy@microsoft.com a-barak@microsoft.com a-bewhi@microsoft.com a-libren@microsoft.com a-markr@microsoft.com a-midumi@microsoft.com a-pakhar@microsoft.com ... zideng@microsoft.com zihliu@microsoft.com zion.brewer@microsoft.com zizhong@microsoft.com zlatkom@exchange.microsoft.com zoinertejada@solliance.net zoltanp@exchange.microsoft.com zorauf@microsoft.com zsolt.zombik@zsoltzombik.com zunqwang@microsoft.com
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 3530 columns


In [7]:
# Will need to store people vocabulary for feature construction during predictions
people_vocabulary = people_features.columns
print people_vocabulary[:2]
print len(people_vocabulary)


Index([u'11franklinc@gmail.com', u'_ram@microsoft.com'], dtype='object')
3530

In [8]:
# Convert to csr_matrix and hstack with Subject feature matrix
import scipy
sparse_people_features = scipy.sparse.csr_matrix(people_features)
print people_features.shape
print sparse_people_features.shape


(10295, 3530)
(10295, 3530)

In [9]:
print sparse_people_features.shape
print train_tfidf.shape
feature_matrix = scipy.sparse.hstack([sparse_people_features, train_tfidf])
print feature_matrix.shape


(10295, 3530)
(10295, 3119)
(10295, 6649)

Train models & compare accuracies



In [10]:
# Split into test and training data sets
from sklearn.model_selection import train_test_split
labels_train, labels_test, features_train, features_test = train_test_split(df['FolderId'], feature_matrix, test_size=0.20, random_state=42)
print labels_train.shape
print labels_test.shape
print features_train.shape
print features_test.shape


(8236,)
(2059,)
(8236, 6649)
(2059, 6649)

In [11]:
# Construct a list of classifiers
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

names = [
    "Nearest Neighbors", 
    "Linear SVM", 
    "Decision Tree", 
    "Random Forest", 
    "Neural Net", 
    "AdaBoost",
]

candidate_classifiers = [
    KNeighborsClassifier(),
    SVC(kernel='linear', C=0.025),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
]

In [12]:
# Train and evaluate models, compare accuracy
from sklearn import metrics
for name, clf in zip(names, candidate_classifiers):
    model = clf.fit(features_train, labels_train)
    predictions = model.predict(features_test)
    print name + ": " + str(metrics.accuracy_score(labels_test, predictions))


Nearest Neighbors: 0.885381253035
Linear SVM: 0.797474502186
Decision Tree: 0.175327829043
Random Forest: 0.078678970374
Neural Net: 0.855269548324
AdaBoost: 0.0806216610005

In [13]:
# Construct a list of classifiers
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

dense_names = [
    "RBF SVM", 
#     "Gaussian Process", # Taking way too long
    "Naive Bayes",
#     "QDA" # Didn't work for classes with only one sample
]

candidate_dense_classifiers = [
    SVC(gamma=2, C=1),
#     GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    GaussianNB(),
#     QuadraticDiscriminantAnalysis()
]

In [14]:
# Train and evaluate models using dense feature matrix, compare accuracy
from sklearn import metrics
dense_features_train = features_train.toarray()
dense_features_test = features_test.toarray()
for name, clf in zip (dense_names, candidate_dense_classifiers):
    model = clf.fit(dense_features_train, labels_train)
    predictions = model.predict(dense_features_test)
    print name + ": " + str(metrics.accuracy_score(labels_test, predictions))


RBF SVM: 0.796988829529
Naive Bayes: 0.937348227295
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-64889003908f> in <module>()
      4 dense_features_test = features_test.toarray()
      5 for name, clf in zip (dense_names, candidate_dense_classifiers):
----> 6     model = clf.fit(dense_features_train, labels_train)
      7     predictions = model.predict(dense_features_test)
      8     print name + ": " + str(metrics.accuracy_score(labels_test, predictions))

/Users/strockis/Source/miniconda2/envs/smart-sort/lib/python2.7/site-packages/sklearn/discriminant_analysis.pyc in fit(self, X, y, store_covariances, tol)
    687             if len(Xg) == 1:
    688                 raise ValueError('y has only 1 sample in class %s, covariance '
--> 689                                  'is ill defined.' % str(self.classes_[ind]))
    690             Xgc = Xg - meang
    691             # Xgc = U * S * V.T

ValueError: y has only 1 sample in class AAMkADNlYWY3MWVjLTMyYjgtNDg1Ny1hZTk4LWFkZGEyZmM4YzBjOAAuAAAAAACZhatKmZhBQaIh_GuBK5qjAQALOo6CFxH4Rb3A38IMpKY5AAAFEg0eAAA=, covariance is ill defined.

Conclusions

  • Models which probably deserve more investigation & tuning (in order):
    • Multiple logistic regression
    • Naive Bayes
    • Nearest neighbors
    • Neural networks
  • Decision trees don't seem to perform well at all (could be my fault though?)
  • Support vector machines are close, but significantly worse than the above
  • Next steps: focus on quality of feature construction

In [ ]: